Search CORE

372 research outputs found

One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers

Author: G Cormode
G Cormode
G Zipf
Graham Cormode
M Cafaro
M Charikar
M Thorup
Mihai Pǎtraşcu
S Das
S Muthukrishnan
Publication venue
Publication date: 02/03/2019
Field of study

Sketches are probabilistic data structures that can provide approximate results within mathematically proven error bounds while using orders of magnitude less memory than traditional approaches. They are tailored for streaming data analysis on architectures even with limited memory such as single-board computers that are widely exploited for IoT and edge computing. Since these devices offer multiple cores, with efficient parallel sketching schemes, they are able to manage high volumes of data streams. However, since their caches are relatively small, a careful parallelization is required. In this work, we focus on the frequency estimation problem and evaluate the performance of a high-end server, a 4-core Raspberry Pi and an 8-core Odroid. As a sketch, we employed the widely used Count-Min Sketch. To hash the stream in parallel and in a cache-friendly way, we applied a novel tabulation approach and rearranged the auxiliary tables into a single one. To parallelize the process with performance, we modified the workflow and applied a form of buffering between hash computations and sketch updates. Today, many single-board computers have heterogeneous processors in which slow and fast cores are equipped together. To utilize all these cores to their full potential, we proposed a dynamic load-balancing mechanism which significantly increased the performance of frequency estimation.Comment: 12 pages, 4 figures, 3 algorithms, 1 table, submitted to EuroPar'1

arXiv.org e-Print Archive

Crossref

Sabanci University Research Database

An Improved Interactive Streaming Algorithm for the Distinct Elements Problem

Author: A. Chakrabarti
A. Razborov
A. Razborov
C. Lund
D.M. Kane
G. Cormode
G. Cormode
G. Cormode
H. Klauck
J. Justesen
O. Lachish
S. Goldwasser
T. Gur
Publication venue
Publication date: 01/01/2014
Field of study

The exact computation of the number of distinct elements (frequency moment

F_0

) is a fundamental problem in the study of data streaming algorithms. We denote the length of the stream by

n

where each symbol is drawn from a universe of size

m

. While it is well known that the moments

F_0,F_1,F_2

can be approximated by efficient streaming algorithms, it is easy to see that exact computation of

F_0,F_2

requires space

\Omega(m)

. In previous work, Cormode et al. therefore considered a model where the data stream is also processed by a powerful helper, who provides an interactive proof of the result. They gave such protocols with a polylogarithmic number of rounds of communication between helper and verifier for all functions in NC. This number of rounds

\left(O(\log^2 m) \;\text{in the case of} \;F_0 \right)

can quickly make such protocols impractical. Cormode et al. also gave a protocol with

\log m +1

rounds for the exact computation of

F_0

where the space complexity is

O\left(\log m \log n+\log^2 m\right)

but the total communication

O\left(\sqrt{n}\log m\left(\log n+ \log m \right)\right)

. They managed to give

\log m

round protocols with

\operatorname{polylog}(m,n)

complexity for many other interesting problems including

F_2

, Inner product, and Range-sum, but computing

F_0

exactly with polylogarithmic space and communication and

O(\log m)

rounds remained open. In this work, we give a streaming interactive protocol with

\log m

rounds for exact computation of

F_0

using

O\left(\log m \left(\,\log n + \log m \log\log m\,\right)\right)

bits of space and the communication is

O\left( \log m \left(\,\log n +\log^3 m (\log\log m)^2 \,\right)\right)

. The update time of the verifier per symbol received is

O(\log^2 m)

.Comment: Submitted to ICALP 201

arXiv.org e-Print Archive

Crossref

Almost Optimal Streaming Algorithms for Coverage Problems

Author: Cormode G.
Kelner J. A.
Mirzasoleiman B.
Muthukrishnan S.
Publication venue
Publication date: 09/03/2017
Field of study

Maximum coverage and minimum set cover problems --collectively called coverage problems-- have been studied extensively in streaming models. However, previous research not only achieve sub-optimal approximation factors and space complexities, but also study a restricted set arrival model which makes an explicit or implicit assumption on oracle access to the sets, ignoring the complexity of reading and storing the whole set at once. In this paper, we address the above shortcomings, and present algorithms with improved approximation factor and improved space complexity, and prove that our results are almost tight. Moreover, unlike most of previous work, our results hold on a more general edge arrival model. More specifically, we present (almost) optimal approximation algorithms for maximum coverage and minimum set cover problems in the streaming model with an (almost) optimal space complexity of

\tilde{O}(n)

, i.e., the space is {\em independent of the size of the sets or the size of the ground set of elements}. These results not only improve over the best known algorithms for the set arrival model, but also are the first such algorithms for the more powerful {\em edge arrival} model. In order to achieve the above results, we introduce a new general sketching technique for coverage functions: This sketching scheme can be applied to convert an

\alpha

-approximation algorithm for a coverage problem to a (1-\eps)\alpha-approximation algorithm for the same problem in streaming, or RAM models. We show the significance of our sketching technique by ruling out the possibility of solving coverage problems via accessing (as a black box) a (1 \pm \eps)-approximate oracle (e.g., a sketch function) that estimates the coverage function on any subfamily of the sets

arXiv.org e-Print Archive

Crossref

Densest Subgraph in Dynamic Graph Streams

Author: A McGregor
AC Gilbert
B Bahmani
B Bahmani
G Cormode
G Cormode
G Gallo
K Kutzkov
KJ Ahn
M Charikar
M Mitzenmacher
S Khuller
V Lee
Publication venue
Publication date: 14/06/2015
Field of study

In this paper, we consider the problem of approximating the densest subgraph in the dynamic graph stream model. In this model of computation, the input graph is defined by an arbitrary sequence of edge insertions and deletions and the goal is to analyze properties of the resulting graph given memory that is sub-linear in the size of the stream. We present a single-pass algorithm that returns a

(1+\epsilon)

approximation of the maximum density with high probability; the algorithm uses O(\epsilon^{-2} n \polylog n) space, processes each stream update in \polylog (n) time, and uses \poly(n) post-processing time where

n

is the number of nodes. The space used by our algorithm matches the lower bound of Bahmani et al.~(PVLDB 2012) up to a poly-logarithmic factor for constant

\epsilon

. The best existing results for this problem were established recently by Bhattacharya et al.~(STOC 2015). They presented a

(2+\epsilon)

approximation algorithm using similar space and another algorithm that both processed each update and maintained a

(4+\epsilon)

approximation of the current maximum density in \polylog (n) time per-update.Comment: To appear in MFCS 201

arXiv.org e-Print Archive

Crossref

Addressing Item-Cold Start Problem in Recommendation Systems using Model Based Approach and Deep Learning

Author: A Rajaraman
D Stojanovski
F Ricci
G Cormode
R Collobert
Publication venue
Publication date: 18/06/2017
Field of study

Traditional recommendation systems rely on past usage data in order to generate new recommendations. Those approaches fail to generate sensible recommendations for new users and items into the system due to missing information about their past interactions. In this paper, we propose a solution for successfully addressing item-cold start problem which uses model-based approach and recent advances in deep learning. In particular, we use latent factor model for recommendation, and predict the latent factors from item's descriptions using convolutional neural network when they cannot be obtained from usage data. Latent factors obtained by applying matrix factorization to the available usage data are used as ground truth to train the convolutional neural network. To create latent factor representations for the new items, the convolutional neural network uses their textual description. The results from the experiments reveal that the proposed approach significantly outperforms several baseline estimators

arXiv.org e-Print Archive

Crossref

Space-optimal Heavy Hitters with Strong Error Bounds

Author: Arasu A.
Berinde R.
Berinde R.
Bonnet P.
Bose P.
Breslau L.
Chakrabarti A.
Charikar M.
Cormode G.
Cormode G.
Demaine E.
Fang M.
Graham Cormode
Lucchese C.
Manku G.
Martin J. Strauss
Piotr Indyk
Radu Berinde
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of "counter-based algorithms" (including the popular and very space-efficient FREQUENT and SPACESAVING algorithms) provide much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining "tail." This shows that counter-based methods are the most space-efficient (in fact, space-optimal) algorithms having this strong error bound. This tail guarantee allows these algorithms to solve the "sparse recovery" problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f* to the frequency vector f so that the L1 error ||f -- f*||[subscript 1] is close to the best possible error min[subscript f2] ||f2 -- f||[subscript 1], where f2 ranges over all vectors with at most k non-zero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.David & Lucile Packard Foundation (Fellowship)Center for Massive Data Algorithmics (MADALGO)National Science Foundation (U.S.). (Grant number CCF-0728645

CiteSeerX

DSpace@MIT

Crossref

Warwick Research Archives Portal Repository

Online Self-Indexed Grammar Compression

Author: F Claude
F Claude
G Cormode
G Navarro
M Karpinski
S Maruyama
S Maruyama
T Gagie
T Gagie
W Rytter
Y Takabatake
Publication venue
Publication date: 06/07/2015
Field of study

Although several grammar-based self-indexes have been proposed thus far, their applicability is limited to offline settings where whole input texts are prepared, thus requiring to rebuild index structures for given additional inputs, which is often the case in the big data era. In this paper, we present the first online self-indexed grammar compression named OESP-index that can gradually build the index structure by reading input characters one-by-one. Such a property is another advantage which enables saving a working space for construction, because we do not need to store input texts in memory. We experimentally test OESP-index on the ability to build index structures and search query texts, and we show OESP-index's efficiency, especially space-efficiency for building index structures.Comment: To appear in the Proceedings of the 22nd edition of the International Symposium on String Processing and Information Retrieval (SPIRE2015

arXiv.org e-Print Archive

Crossref

Streaming Verification in Data Analysis

Author: A Chakrabarti
A Chakrabarti
C Lund
G Cormode
G Cormode
H Klauck
M Edwards
S Har-Peled
TF Gonzalez
TM Chan
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

The Frequent Items Problem in Online Streaming under Various Performance Measures

Author: A.R. Karlin
D.D. Sleator
E. Cohen
G. Cormode
J. Boyar
J. Boyar
J. Boyar
L. Becchetti
R. Dorrigiv
R. Dorrigiv
Y. Giannakopoulos
Publication venue
Publication date: 01/01/2013
Field of study

In this paper, we strengthen the competitive analysis results obtained for a fundamental online streaming problem, the Frequent Items Problem. Additionally, we contribute with a more detailed analysis of this problem, using alternative performance measures, supplementing the insight gained from competitive analysis. The results also contribute to the general study of performance measures for online algorithms. It has long been known that competitive analysis suffers from drawbacks in certain situations, and many alternative measures have been proposed. However, more systematic comparative studies of performance measures have been initiated recently, and we continue this work, using competitive analysis, relative interval analysis, and relative worst order analysis on the Frequent Items Problem.Comment: IMADA-preprint-c

arXiv.org e-Print Archive

CiteSeerX

Crossref

University of Southern Denmark Research Output

Space-efficient Feature Maps for String Alignment Kernels

Author: CC Chang
G Cormode
H Lodhi
H Saigo
M Kanehisa
MC Ferris
RE Fan
S Kim
T Gärtner
T Hofmann
TF Smith
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

String kernels are attractive data analysis tools for analyzing string data. Among them, alignment kernels are known for their high prediction accuracies in string classifications when tested in combination with SVM in various applications. However, alignment kernels have a crucial drawback in that they scale poorly due to their quadratic computation complexity in the number of input strings, which limits large-scale applications in practice. We address this need by presenting the first approximation for string alignment kernels, which we call space-efficient feature maps for edit distance with moves (SFMEDM), by leveraging a metric embedding named edit sensitive parsing (ESP) and feature maps (FMs) of random Fourier features (RFFs) for large-scale string analyses. The original FMs for RFFs consume a huge amount of memory proportional to the dimension d of input vectors and the dimension D of output vectors, which prohibits its large-scale applications. We present novel space-efficient feature maps (SFMs) of RFFs for a space reduction from O(dD) of the original FMs to O(d) of SFMs with a theoretical guarantee with respect to concentration bounds. We experimentally test SFMEDM on its ability to learn SVM for large-scale string classifications with various massive string data, and we demonstrate the superior performance of SFMEDM with respect to prediction accuracy, scalability and computation efficiency.Comment: Full version for ICDM'19 pape

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository